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Abstract 

Identifying the hottest paths in the control flow graph of 
a routine can direct optimizations to portions of the code 
where most resources are consumed. This powerful method- 
ology, called path profiling, was introduced by Ball and 
Larus in the mid 90's j^ and has received considerable at- 
tention in the last 15 years for its practical relevance. A 
shortcoming of Ball-Larus path profiling was the inability 
to profile cyclic paths, making it difficult to mine interest- 
ing execution patterns that span multiple loop iterations. Pre- 
vious results, based on rather complex algorithms, have at- 
tempted to circumvent this limitation at the price of signifi- 
cant performance losses already for a small number of itera- 
tions. In this paper, we present a new approach to multiple- 
iterations path profiling, based on data structures built on 
top of the original Ball-Larus numbering technique. Our ap- 
proach allows it to profile all executed paths obtained as a 
concatenation of up to k Ball-Larus acycUc paths, where k is 
a user-defined parameter. An extensive experimental investi- 
gation on a large variety of Java benchmarks on the Jikes 
RVM shows that, surprisingly, our approach can be even 
faster than Ball-Larus due to fewer operations on smaller 
hash tables, producing compact representations of cyclic 
paths even for large values of k. 

Categories and Subject Descriptors C.4 [Performance of 
Systems]: Measurement Techniques; D.2.2 [Software Engi- 
neering}: Tools and Techniques — ^programmer workbench; 
D.2.5 [Software Engineering}: Testing and Debugging — 
diagnostics, tracing 

General Terms Algorithms, Measurement, Performance. 

Keywords Profiling, dynamic program analysis, instru- 
mentation. 
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1. Introduction 

Path profiling is a powerful methology for identifying per- 
formance bottlenecks in a program. The approach consists 
of associating performance metrics, usually frequency coun- 
ters, to paths in the control flow graph. Identifying hot paths 
can direct optimizations to portions of the code that could 
yield significant speedups. For instance, trace scheduling can 
improve performance by increasing instruction-level paral- 
lelism along frequently executed paths lll3|] . The seminal pa- 
per by Ball and Larus [4] introduced a simple and elegant 
path profiling technique. The main idea was to implicitly 
number all possible acyclic paths in the control flow graph 
so that each path is associated with a unique compact path 
identifier (ID). The authors showed that path IDs can be ef- 
ficiently generated at runtime and can be used to update a 
table of frequency counters. Although in general the number 
of acyclic paths may grow esponentially with the graph size, 
in typical control flow graphs this number is usually small 
enough to fit in current machine wordsizes, making this ap- 
proach very effective in practice. 

While the original Ball-Larus approach was restricted to 
acyclic paths obtained by cutting paths at loop back edges, 
profiling paths that span consecutive loop iterations is a de- 
sirable yet difficult task that can yield better optimization 
opportunities. Consider for instance the problem of elimi- 
nating redundant executions of instructions, such as^ loads 
and stores ||7[], conditional jumps Oa], expressions |19fl, and 
array bounds checks [8]. A typical situation is that the same 
instruction is redundantly executed at each loop iteration, 
which is particularly common for arithmetic expressions and 
load operations |7, 9]. To identify such redundancies, paths 
that extend across loop back edges need to be profiled. An- 
other application is trace scheduling llisll : if a frequently exe- 
cuted cyclic path is found, compilers may unroll the loop and 
perform trace scheduling on the unrolled portion of code. 
Tallam et al. ||20|] provide a comprehensive discussion of the 
benefits of multi-iterations path profiling. 

Different authors have proposed techniques to profile 
cycUc paths by modifying the original Ball-Larus path 
numbering scheme in order to identify paths that extend 
across multiple loop iterations lIlTJ Il9i 12011 . Unfortunately, 



2013/4/19 



f Program 


1 




1 


Static 
analysis 




o 

s 

E 


1 

Control 

flow 

graph 

V y 



Ball-Larus path 

numbering and 

tracing framework 



Instrumented 

program 
(probes added) 



Execution 
emit r 



c 


Stream of Ball- 


Larus path 


IDs r ^ 












T 


count[r]+ + 


Forest 
construction 



c 



BL path 
frequencies 



BL path profiler 



3 C 



k-iteration 
path forest 



k-iter. path profiler 



Figure 1 : Overview of our approach: classical Ball-Larus profiling 
versus k-iteration path profiling, cast in a common framework. 



all known solutions require rather complex algorithms that 
incur severe performance overheads even for short cyclic 
paths, leaving it as an interesting open question to find sim- 
pler and more efficient alternative methods. 

Our results. In this paper, we present a novel approach to 
multiple-iterations path profiling, which provides substan- 
tially better performance than previous techniques even for 
long paths. Our method stems from the observation that any 
cyclic execution path in the control flow graph of a routine 
can be described as a concatenation of Ball-Larus acyclic 
paths (BL paths). In particular, we show how to accurately 
profile all executed paths obtained as a concatenation of up 
to k BL paths, where A: is a user-defined parameter. To do 
so, we reduce multiple-iterations path profiling to the prob- 
lem of counting ?i-grams, i.e., contiguous sequences of n 
items from a given sequence. To compactly represent col- 
lected profiles, we organize them in a prefix tree (or trie) [14] 
of depth up to k where each node is labeled with a BL path, 
and paths in the tree represent concatenations of BL paths 
that were actually executed by the program, along with their 
frequencies. 

We implemented our ideas by developing a Java perfor- 
mance profiler in the Jikes Research Virtual Machine [1]. 
To make fair performance comparisons with state-of-the-art 
previous profilers, we built our code on top of the BLPP pro- 
filer developed by Bond |10lll5|], which provides an efficient 
implementation of the Ball-Larus acyclic path profiling tech- 
nique. A broad experimental study on a large suite of promi- 
nent Java benchmarks on the Jikes Research Virtual Machine 
shows that our profiler can trace long paths efficiently, mak- 
ing it possible to collect profiles that would have been too 
costly to gather using previous multi-iterations techniques. 



procedure bl_path_nuniber ing() : 
1 : for each basic block v in reverse topological order do 
if V is the exit block then 

numPaths(t;) •<— 1 
else 

numPaths(i;) •<— 

for each outgoing edge e = {v,w) Ao 
val(e) = numPaths(-i;) 
numPaths(w) -l-= numPaths(u;) 
end for 
end If 
end for 



Figure 2: Ball-Larus path numbering algorithm. 



Techniques. Differently from previous approaches 11171 Il9l 
|2y], which rely on modifying the Ball-Larus path numbering 
to cope with cycles, our method does not require any modifi- 
cation of the original numbering technique described in ||4| . 
The main idea behind our approach is to fully decouple the 
task of tracing Ball-Larus acyclic paths at run time from the 
task of concatenating and storing them in a data structure 
to keep track of multiple iterations. The decoupling is per- 
formed by letting the Ball-LaiTis profiling algorithm issue 
a stream of BL path IDs (see Figure [Til, where each ID is 
generated when a back edge in the control flow graph is tra- 
versed or the current procedure is abandoned. As a conse- 
quence of this modular approach, our method can be imple- 
mented on top of existing Ball-Larus path profilers, making 
it simpler to code and maintain. 

Our profiler introduces a technical shift based on a 
smooth blend of the path numbering methods used in in- 
traprocedural path profiling with data structure-based tech- 
niques typically adopted in interprocedural profiling, such 
as calling context profiling. The key to the efficiency of our 
approach is to replace costly hash table accesses, which are 
required by the Ball-Larus algorithm to maintain path coun- 
ters for non-small programs, with substantially faster op- 
erations on trees. With this idea, we can profile paths that 
extend across many loop iterations in comparable time, if 
not faster, than profiling acyclic paths on a large variety of 
industry-strength benchmarks. 

Organization of the paper. In Section |2] we describe our 
approach and in Section [3] we discuss how to implement it. 
The results of our experimental investigation are detailed 
in Section 2] and related work is surveyed in Section |5] 
Concluding remarks are given in Section |6] 

2. Approach 

In this section we provide an overview of our approach to 
multiple-iterations path profiling. From a high level point of 
view, illustrated in Figure[T] the entire process is divided into 
two main phases: 
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Figure 3: Control flow graph with Ball-Lams instrumentation modified to emit acyclic path IDs to an output stream and running example of 
our approach that shows a 4-iteration path forest (4-IPF) for a possible small execution trace. 



1 . instrumentation and execution of the program to be pro- 
filed (top of Figure [TJ; 

2. profiling of paths (bottom of Figure [T). 

The first phase is almost identical to the original approach 
described in 1^. The target program is statically analyzed 
and a control flow graph (CFG) is constructed for each rou- 
tine of interest. The CFG is used to instrument the original 
program by inserting probes, which allow it to trace paths at 
run time. When the program is executed, taken acyclic paths 
are identified using the inserted probes. The main difference 
with the Ball-Larus approach is that, instead of directly up- 
dating a frequency counters table here, we emit a stream of 
path IDs, which is passed along to the next stage of the pro- 
cess. This allows us to decouple the task of tracing taken 
paths from the task of profiling them. 

The profiling phase can be either the original hash table- 
based method of [4] used to maintain BL path frequencies 
(bottom-left of Figure (TJ, or other approaches such as the 
one we propose, i.e., profiling concatenations of BL paths in 
a forest-based data structure (bottom-right of Figure [Q. Dif- 
ferent profiling methods can be therefore cast into a common 
framework, increasing flexibility and helping us make more 
accurate comparisons. 

We start with a brief overview of the Ball-Larus path 
tracing technique, which we use as a the first stage of our 
profiler 

2.1 Ball-Larus Path Tracing Algorithm 

The Ball-Larus path profiling (BLPP) technique [4] identi- 
fies each acyclic path that is executed in a routine. Paths start 
on the method entry and terminate on the method exit. Since 
loops make the CFG cyclic, loop back edges are substituted 
by a pair of dummy edges: the first one from the method en- 
try to the target of the loop back edge, and the second one 
from the source of the loop back edge to the method exit. 



After this (reversible) transformation, the CFG of a method 
becomes a DAG (directed acyclic graph) and acyclic paths 
can be enumerated. 

The Ball-Larus path numbering algorithm, shown in Fig- 
ure|2] assigns a value val{e) to each edge e of the CFG such 
that, given N acyclic paths, the sum of the edge values along 
any entry-to-exit path is a unique numeric ID in [0, N-1]. A 
CFG example and the corresponding path IDs are shown in 
Figure [3] notice that there are eight distinct acyclic paths, 
numbered from to 7, starting either on the method's entry 
A, or at loop header B (target of back edge {E, B)). 

BLPP places instrumentation on edges to compute a 
unique path number for each possible path. In particular, 
it uses a variable r, called probe or path register, to compute 
the path number. Variable r is first initialized to zero upon 
method entry and then is updated as edges are traversed. 
When an edge that reaches the method exit is executed, or 
a back edge is traversed, variable r represents the unique 
ID of the taken path. As observed, instead of using the path 
ID r to increase the path frequency counter (cotint [r] ++), 
we defer the profiling stage by emitting the path ID to an 
output stream (emit r). To support profiling over multiple 
invocations of the same routine, we annotate the stream with 
the special marker * to denote a routine entry event. Instru- 
mentation code for our CFG example is shown on the left of 
Figure [3] 

2.2 fc-Iterations Path Profling 

The second stage of our profiler takes as input the stream of 
BL path IDs generated by the first stage and uses it to build 
a data structure that keeps track of the frequencies of each 
and every distinct taken path consisting of the concatenation 
of up to k BL paths, where fc is a user-defined parameter. 
This problem is equivalent to counting all n-grams, i.e., 
contiguous sequences of n items from a given sequence of 
items, for each n < k. Our solution is based on the notion 
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of prefix forest, which compactly encodes a Hst of sequences 
by representing only once repetitions and common prefixes. 
A prefix forest can be defined as follows: 

Definition 1 (prefix forest). Let L — {xi,X2, ■ ■ ■ ,Xq) be 
any list of finite-length sequences over an alphabet H. A 
prefix forest T{L) of L is a minimal labeled forest such that, 
for each Xi = (ai, 02, . . . , a„) S L there is a path tt,; — 
(ai, a2, ■ ■ ■ , ctn) G -^{L) where each node aj, j G [1, n] : 

1. is labeled with aj, i.e., £{aj) ~ Gj G H; 

2. has an associated counter c{aj) that counts the number 
of times (ai, 02, . . . , a^) C Xi occurs in L. 

k-Iterations Path Forest. The output of our profiler is a 
prefix forest, which we call k-Iterations Path Forest (/c-IPF), 
that compactly represents all observed contiguous sequences 
of up to k BL path IDs: 

Definition 2 (/c-Iterations Path Forest). Given an input 
stream S representing a sequence of BL path IDs and * 
markers, the k-Iterations Path Forest (k-IPF) ofY, is defined 
as k-IPF — J- (list of all n-grams ofT, that do not contain *, 
with n < k). 

By Definition|2l the k-TPF is the prefix forest of all consecu- 
tive subsequences of up to k BL path IDs in S. 

Example I. Figure |3] provides an example showing the 4- 
IPF constructed for a small sample execution trace consist- 
ing of a sequence of 44 basic blocks encountered during one 
invocation of the routine described by the control flow graph 
on the left. Notice that the full (cyclic) execution path starts 
from the entry basic block A and terminates on the exit ba- 
sic block F. The first stage of our profiler issues a stream E 
of BL path IDs that are obtained by emitting the value of the 
probe register r each time a back edge is traversed, or the exit 
basic block is executed. Observe that the sequence of emit- 
ted path IDs induces a partition of the execution path into 
Ball-Lams acyclic paths. Hence, the sequence of executed 
basic blocks can be fully reconstructed from the sequence E 
of path IDs. 

The 4-IPF built in the second stage contains exactly 
one tree for each of the 4 distinct BL path IDs (0, 2, 3, 
6) that occur in the stream. Notice that path frequencies 
in the first level of the 4-IPF are exactly those that tradi- 
tional Ball-Larus profiling would collect. The second level 
contains the frequencies of taken paths obtained by con- 
catenating 2 BL paths, etc. Notice that the path labeled 
with (2, 0, 0, 2) in the 4-IPF, which corresponds to the path 
{B,C,E,B,D,E,B,D,E,B,C,E) in the control flow 
graph, is a 4-gram that occurs 3 times in E and is one of 
the most frequent paths among those that span from 2 up to 
4 loop iterations. 

Properties. A fc-IPF has some relevant properties: 



frequency counter (T^^ (9713511) 
BL path ID 




Figure 4: Subtree of the 11-IPF of method org.eclipse. jdt . 
internal. compiler .parser .Scanner . checkTaskTag taken 
from release 2006-MR2 of the DaCapo benchmark suite. 

1. V node a e fc-IPF, fc > 0: 

I3i : {a,l3i)ek-lPF 

2. Vfc > 0, fc-IPF C (fc + 1)-IPF 

By Property 1, since path counters are non-negative, they 
are monotonically non-increasing as we walk down the tree. 
The inequality > in Property 1 may be strict (>) if the 
execution trace of a routine invocation does not end at the 
exit basic block; this may be the case when a subroutine call 
is performed at an internal node of the CFG. Notice that a 1- 
IPF includes only acyclic paths and yields exactly the same 
counters as a Ball-Larus profiler [4]. 

Example 2. In Figure |4] we show a subtree of the 11- 
IPF generated for method checkTaskTag of class Scanner 
in the org.eclipse .jdt . internal, compiler .parser 
package of the eclipse benchmark included in the DaCapo 
release 2006-MR2. In the subtree, we pruned all nodes 
with counters less than 10% of the counter of the root. 
Notice that, after executing the BL path with ID 38, 66% 
of the times the program executes path 86, and 28% of 
the times BL path 87. When 86 follows 38, 100% of the 
times the control flow takes the path (86, 86, 86, 755), which 
spans four loop iterations and may be successfully un- 
rolled to perform trace scheduling. Interestingly, sequence 
(38, 86, 86, 86, 755, 38, 86, 86, 86, 755) of 1 1 BL path IDs, 
highlighted in Figure |4] accounts for more than 50% of all 
execution of the first BL path in the sequence, showing that 
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Figure 5: 4-SF resulting from the execution trace of Figure |3] 

sequence (38, 86, 86, 86, 755) is likely to be repeated con- 
secutively more than once. 

2.3 Algorithms 

In this section, we show how to efficiently construct a fc-IPF 
profile starting from a stream of BL path IDs. The main 
idea is to construct an intermediate data structure that can 
be updated quickly, and then convert this data structure into 
a fc-IPF more efficiently when the stream is over. As inter- 
mediate data structure, we use a variant of the k-slab forest 
(fc-SF) introduced in |[3il, which we adapt to our context as 
follows: 

Definitions (fc-slab forest). Letk > 2 and let si, 82,83, . .. , 
8m be the subsequences ofYi obtained by: (1) splitting S at 
* markers, (2) removing the markers, and (3) cutting the re- 
maining subsequences every fc — 1 consecutive items. The 
fc-slab forest (k-SF) ofY, is defined as k-SF ~ F{list of all 
prefixes of 81 ■ 82 and all prefixes of length > k of 8i ■ s^+i, 
\/i e [2,m — 1]), where 8i ■ 5,4-1 denotes the concatenation 
of Si and 8i^i. 

By Definition[3] since each s, has length up to fc — 1, then a 
fc-SF has at most 2k — 2 levels and depth 2fc — 3. 

Example 3. Let us consider again the example given in 
Figure [3] For fc = 4, we break the stream into maximal 
subsequences of up to fc — 1 = 3 consecutive BL path IDs: 



6,2,0 



0, 2, 2 , 0, 0, 2 , 2, 0, 



2,3 ) 



The 4-SF of E, defined in terms of si, . . . , S5, is shown 
in Figure |5] The forest is obtained as J-{L), where L — ( 
(6), (6, 2), (6, 2,0), (6, 2,0,0), (6, 2,0, 0,2), (6, 2,0, 0,2, 2), 
(0, 2, 2, 0), (0, 2, 2, 0, 0), (0, 2, 2, 0, 0, 2), (0, 0, 2, 2), (0, 0, 2, 
2,0), (0,0,2,2,0,0), (2, 0,0,2), (2, 0,0,2,3)). 

k-SF Construction Algorithm. Given a stream S formed 
by * markers and BL path IDs, the fc-SF of E can be con- 
structed by calling the procedure process_bl_path_id(r) 
shown in Figure|6]on each item r of E. The streaming algo- 
rithm, whjch is a variant of the fc-SF construction algoiitm 
given in yy for the different setting of bounded-length call- 
ing contexts, keeps the following information: 



r and c(t) = to fc-SF and R 



procedure process_bl_path_id(r): 

1 : iir — * then 

2: n^ 
3: r ^ null 
4: return 

5: end if 

6: if n mod (fc — 1) = then 

7: /3^r 

8: r ^find(_R,r) 

9: if r = null then 
10: add root r with 1{t) 

11: end if 

12: else 

13: find child uj of node r with label l{uj) = r 
14: iiiu = null then 

15: add node oj with l{uj) = r and c{uj) = to fc-SF 

16: add arc (r, cj) to fc-SF 

17: end if 
18: r ^ cj 

19: end if 

20: if/? / null then 

21: find child v of node /3 with label l{v) = r 
22: if i; = null then 

23: add node v with l{v) = r and c{v) = to fc-SF 

24: add ai'C (/3, v) to fc-SF 

25: end if 
26: ^ ^ D 
27: c(/3) ^ c(/?) + 1 

28: else 

29: c{t) <- c{t) + 1 

30: end if 

31: n 4- n + 1 

Figure 6: Streaming algorithm for fc-SF construction. 



• a hash table R, initially empty, containing pointers to the 
roots of trees in the fc-SF, hashed by node labels; since 
no two roots have the same label, the lookup operation 
f ind(i?, r) returns the pointer to the root containing label 
r, or null if no such root exists; 

• a variable n that counts the number of BL path IDs 
processed since the last * marker; 

• a variable t (top) that points either to null, or to the 
current fc-SF node in the upper part of the forest (levels 
through fc — 2); 

• a variable /3 (bottom) that points either to null, or to the 
current fc-SF node in the lower part of the forest (levels 
fc - 1 through 2fc - 3). 

The main idea of the algorithm is to progressively add new 
paths to an initially empty fc-SF. The path formed by the first 
fc — 1 items since the last * marker is added to one tree of 
the upper part of the forest. Each later item r is added at up 
to two different locations of the fc-SF: one in the upper part 
of the forest (lines 13-17) as a child of node r (if no child 
of T labeled with r already exists), and the other one in the 
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procedure make_k_ipf () : 



for each node p G fc-SF do 

\il{p) ^/then 

add £(p) to I and let s{e.{p)) ^ 

end If 

addptos(^(p)) 
end for 

let the fc-IPF be formed by a dummy root ( 
for each r e J do 

for each p e s(r) do 
join_subtree(p, (j>, k) 

end for 
end for 
remove dummy root <f> from the fc-IPF 



procedure join_subtree(p, 7, d): 

I: 5 ^ child of 7 in the fc-IPF s.t. £{5) = £{p) 

2: its = null then 

3: add new node 5 as a child of 7 in the fc-IPF 

4: £{S) 4- £{p) and c{S) <- c(p) 

5: else 

6: c{S) ^ c{S) + c{p) 

7: end if 

8: if d > then 

9: for each child cr of p in the fc-SF do 

10: join_subtree((T, 5,d — 1) 

1 1 : end for 

12: end if 

Figure 7: Algorithm for converting a fc-SF into a fc-IPF. 



lower part of the forest (lines 21-25) as a child of node /3 (if 
no child of /3 labeled with r already exists). Counters of pro- 
cessed nodes already containing r are incremented by one 
(either line 27 or line 29). Both t and /? are updated to point 
to the child labeled with r (lines 18 and 26, respectively). 
The running time of the algorithm is dominated by lines 8 
and 10 (hash table accesses), and by lines 13 and 21 (node 
children scan). Assuming that operations on R require con- 
stant time, the per-item processing time is 0{5), where 5 is 
the maximum degree of a node in the k-SF. Our experiments 
revealed that S is on average a typically small constant value. 

k-SF to k-IPF Conversion. Once the stream S is over, i.e., 
the profiled thread has terminated, we convert the fc-SF into 
a fc-IPF using the procedure make_k_ipf shown in Figure|7] 
The algorithm creates a set / of all distinct path IDs that 
occur in the fc-SF and for each r in / builds a set s{r) 
containing all nodes p of the fc-SF labeled with r (lines 
2-7). To build the fc-IPF, the algorithm lists each distinct 
path ID r and joins to the fc-IPF all subtrees of depth up to 
fc — 1 rooted at a node in s{r) in the fc-SF, as children of a 
dummy root, which is added for the sake of convenience and 
then removed. The join operation is specified by procedure 
join_stibtree, which performs a traversal of a subtree of 
the fc-SF of depth less than fc and adds nodes to fc-IPF so that 



all labeled paths in the subtree appear in the fc-IPF as well, 
but only once. Path counters in the fc-SF are accumulated 
in the corresponding nodes of the fc-IPF to keep track of 
the number of times each distinct path consisting of the 
concatenation of up to fc BL paths was taken by the profiled 
program. 

3. Implementation 

In this section we describe the implementation of our pro- 
filer, which we call k-BLPP, in the Jikes Research Virtual 
Machine OJ. 

3.1 Adaptive Compilation 

The Jikes RVM is a high performance metacircular virtual 
machine: unlike most others JVMs, it is written in Java. 
Jikes RVM does not include an interpreter: all bytecode 
must be first translated into native machine code. The unit of 
compilation is the method, and methods are compiled lazily 
by a fast non-optimizing compiler - the so-called baseline 
compiler - when they are first invoked by the program. 
As execution continues, the Adaptive Optimization System 
monitors program execution to detect program hot spots and 
selectively recompiles them with three increasing levels of 
optimization. Note that all modern production JVMs rely on 
some variant of selective optimizing compilation to target 
the subset of the hottest program methods where they are 
expected to yield the most benefits. 

Recompilation is performed by the optimizing compiler, 
that generates higher-quality code but at a significantly 
larger cost than the baseline compiler Since Jikes RVM 
quickly recompiles frequently executed methods, we imple- 
mented k-BLPP in the optimizing compiler only. 

3.2 Inserting Instrumentation on Edges 

fc-BLPP adds instrumentation to hot methods in three passes: 

1 . building the DAG representation; 

2. assigning values to edges; 

3. adding instrumentation to edges. 

fc-BLPP adopts the smart path numbering algorithm pro- 
posed by Bond and McKinley 1 1111 to improve performance 
by placing instrumentation on cold edges. In particular, line 
6 of the canonical Ball-Larus path numbering algorithm 
shown in Figure |2] is modified such that outgoing edges are 
picked in decreasing order of execution frequency. For each 
basic block edges are sorted using existing edge profiling in- 
formation collected by the baseline compiler, thus allowing 
us to assign zero to the hottest edge so that fc-BLPP does not 
place any instrumentation on it. 

During compilation, the Jikes RVM generates yield points, 
which are program points where the iTinning thread deter- 
mines if it should yield to another thread. Since JVMs need 
to gain control of threads quickly, compilers insert yield 
points in method prologues, loop headers, and method epi- 
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logues. We modified the optimizing compiler to also store 
the path profiling probe on loop headers and method epi- 
logues. Ending paths at loop headers rather than back edges 
causes a path that traverse a header to be split into two 
paths: this difference from canonical Ball-Larus path pro- 
filing is minor because it only affects the first path through a 
loop IIOO. 

Note that optimizing compilers do not always insert yield 
points: this occurs when a method either does not con- 
tain branches (hence its profile is trivial) or is marked as 
uninterruptible. The second case occurs in internal Jikes 
RVM methods only; the compiler occasionally inlines such 
a method into an application method, and this might result 
in a loss of information only when the execution reaches a 
loop header contained in the inlined method. However, this 
loss of information appears to be negligible llioll . 

3.3 Path Profiling 

The fc-SF construction algorithm described in Section 12.21 
is implemented using a standard first-child, next-sibling 
representation for nodes: this representation is very space- 
efficient, while experimental results show that the average 
degree of a node is usually low. 

Tree roots are stored and accessed through an efficient 
stripped-down implementation of a hash map, using the pair 
represented by the Ball-Larus path ID and the unique iden- 
tifier associated to the current routine as key. Note that this 
map is typically smaller than a map required by a traditional 
BLPP profiler, since tree roots represent only a fraction of 
the distinct path IDs encountered during the execution. Con- 
sider, for instance, a routine with N acyclic paths whose con- 
trol flow graph contains a common and unique binary branch 
before the first cycle is entered: since cyclic paths are trun- 
cated on loop headers, only two distinct path IDs can appear 
as a tree root in the hash map, while the remaining N — 2 
paths can appear only inside non-root nodes. 

4. Experimental Evaluation 

In this section we report the result of an extensive exper- 
imental evaluation of our approach. The goal is to assess 
the performance of our profiler compared to previous ap- 
proaches and to study properties of path profiles that span 
multiple iterations for several representative benchmarks. 

4.1 Experimental Setup 

Bechmarks. We evaluated k-BLPP against a variety of 
prominent benchmarks drawn from three suites. The DaCapo 
suite 1 5] consists of a set of open source, real- world appli- 
cations with non-trivial memory loads. We use the super- 
set of all benchmarks from DaCapo releases 2006-MR2 and 
9.12 that can run successfully with Jikes RVM, using the 
largest available workload for each benchmark. The SPEC 
suite focuses on the performance of the hardware proces- 
sor and memory subsystem when executing common gen- 



values of k (2, 3, 4, 6, 8, 11, 16) with an updated version 
of the BLPP profiler developed by Bond {m El, which 



eral purpose application computation^!]. Finally, we chose 
two memory-intensive benchmarks from the Java Grande 
2.0 suite 0121] to further evaluate the performance of k-BLPP. 

Compared Codes. In our experiments, we analyzed the 
native (uninstmmented) version of each benchmark and its 
instrumented counterparts, comparing /c-BLPP for different 

undated ' 

implements the Ball-Larus acyclic path profiling technique. 

Platform. Our experiments were performed on a 2.53GHz 
Intel Core2 Duo T9400 with 128KB of LI data cache, 6MB 
of L2 cache, and 4 GB of main memory DDR3 1066, run- 
ning Ubuntu 12.10, Linux Kernel 3.5.0, 32 bit. We ran all of 
the benchmarks on Jikes RVM 3.1.3 (default production 
build) using a single core and a maximum heap size equal to 
half of the amount of physical memory. 

Metrics. We considered a variety of metrics, including 
wall-clock time, number of operations per second performed 
by the profiled program, number of hash table operations, 
data structure size (e.g., number of hash table items for 
BLPP and number of /c-SF nodes for fc-BLPP), and statistics 
such as average node degree of the /c-SF and the fc-IPF and 
average depth of fc-IPF leaves. To interpret our results, we 
also "profiled our profiler" by collecting hardware perfor- 
mance counters with perf |18], including LI and L2 cache 
miss rate, branch mispredictions, and cycles per instruction 
(CPI). 

Methodology. For each benchmark/profiler combination, 
we performed at least 7 trials, each preceded by a warmup 
execution, and computed the arithmetic mean. We monitored 
variance, increasing the number of trials for problematic 
benchmarks. Performance measurements were collected on 
a machine with negligible background activity. 

4.2 Experimental Results 

Performance overhead. In Figure |8] we report for each 
benchmark the profiling overhead of fc-BLPP relative to 
BLPP. The chart shows that for 12 out of 16 benchmarks 
the overhead decreases for increasing values of fc, provid- 
ing up to almost 50% improvements over BLPP. This is ex- 
plained by the fact that hash table accesses are performed by 
process_bl_path_id every fc — 1 items read from the input 
stream between two consecutive routine entry events (lines 
8 and 10 in Figure|6]l. As a consequence, the number of hash 
table operations for each routine call is 0(1 + N /{k — 1)), 
where N is the total length of the path taken during the invo- 
cation. In Figure |9] we report the measured number of hash 
table accesses for our experiments, which decreases as pre- 
dicted on all benchmarks with intense loop iteration activity. 
Notice that, not only fc-BLPP performs fewer hash table op- 
erations, but since only a subset of BL path IDs are inserted, 



' Unfortunately, only a few benchmarks from SPEC JVM2008 can run 
successfully with Jikes RVM due to limitations of the GNU classpath. 
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Figure 8: Performance of fc-BLPP relative to BLPP. 
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Figure 9: Number of liash table operations performed by fc-BLPP relative to BLPP. 



the table is also smaller yielding further performance im- 
provements. For codes such as avrora and hsqldb, which 
perform on average a small number of iterations, increasing 
k beyond this number does not yield any benefit. 

On eclipse, fc-BLPP gets faster as k increases, but dif- 
ferently from all other benchmarks in this class, remains 
slower than BLPP. The reason is that, due to structural prop- 
erties of the benchmark, the average number of node scans 
at lines 13 and 21 of process_bl_path_id is rather high 
(58.8 for fc = 2 down to 10.3 for fc = 16). In contrast, the 
average degree of internal nodes of the fc-SF is small (2.6 for 
fc = 2 decreasing to 1.3 for fc = 16), hence there is intense 
activity on nodes with a high number of siblings. No other 
benchmark exhibited this extreme behavior We expect that 



a more efficient implementation of process_bl_path_id, 
e.g., by adaptively moving hot children to the front of the list, 
could reduce the scanning overhead for this kind of worst- 
case benchmarks as well. 

Benchmarks compress, scimark.monte_carlo, heap- 
sort, and md made an exception to the general trend we ob- 
served, with performance overhead increasing, rather than 
decreasing, with fc. To justify this behavior, we collected 
and analyzed several hardware performance counters and 
noticed that on these benchmarks our fc-BLPP implementa- 
tion suffers from increased CPI for higher values of fc. Fig- 
ure [TO] (a) shows this phenomenon, comparing the four out- 
liers with other benchmarks in our suite. By analyzing LI 
and L2 cache miss rates, reported in Figure [TO](b) and Fig- 
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Figure 1 1 : Space requirements: number of hash table entries in BLPP and number of nodes in the fc-SF. 
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Figure 13: Average degree of fc-IPF internal nodes. 
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Figure 14: Average depth of fc-lPF leaves. 




ure[TO](c), we noticed that performance degrades due to poor 
memory access locality. We believe this to be an issue of 
our current implementation of fc-BLPP, in which we did not 
make any effort aimed at improving cache efficiency, rather 
than a htnitation of the general approach we propose. 

Space Usage. Figure [TT] compares the space requirements 
of BLPP and fc-BLPP for different values of fc. The chart 
reports the total number of items stored in the hash table 
by BLPP and the number of nodes in the fc-SF. Since both 
BLPP and fc-BLPP exhaustively encode exact counters for 
all distinct taken paths of bounded length, space depends on 
intrinsic structural properties of the benchmark. Programs 
with intense loop iteration activity are characterized by sub- 



stantially higher space requirements by fc-BLPP, which col- 
lects profiles containing up to several milUons of paths. No- 
tice that on some benchmarks we ran out of memory for 
large values of fc, hence some bars in the charts we report 
in this section are missing. In Figure [12] we report the num- 
ber of nodes in the fc-IPF, which corresponds to the number 
of paths profiled by fc-BLPP. Notice that, since a path may 
be represented more than once in the fc-SF, the fc-IPF repre- 
sents a more compact version of the fc-SF. 

Structural Properties of Collected Profiles. As a final ex- 
periment, we measured structural properties of the fc-IPF 
such as average degree of internal nodes (Figure [13} and the 
average leaf depth (Figure [l4ll. Our tests reveal that the av- 
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overhead 
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profiling 
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profiling 

cyclic paths 
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larger 
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profiling 

cyclic paths 
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larger 
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profiling 

cyclic paths 

fc-iteration paths 
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Table 1: Comparison of different patli profiling techniques. 



erage node degree generally decreases with k, showing that 
similar patterns tend to appear frequently across different it- 
erations. Some benchmarks, however, such as sunf low and 
heapsort exhibit a larger variety of path ramifications, wit- 
nessed by increasing node degrees at deeper levels of the 
/c-IPF. The average leaf depth allows it to characterize the 
loop iteration activity of different benchmarks. Notice that 
some benchmarks, such as avrora and hsqldb, have short 
cycles. Hence, by increasing k beyond the maximum cycle 
length, fc-BLPP does not collect any additional information. 

Discussion. From our experiments, we could draw two 
main conclusions: 

1. Using tree-based data structures to represent intraproce- 
dural control flow allows it to substantially reduce the 
performance overhead of path profiling by decreasing the 
number of hash operations, which also operate on smaller 
tables. This approach yields the first profiler that can han- 
dle loops that extend across multiple loop iterations faster 
than the general Ball-Larus technique based on hash ta- 
bles for maintaining path frequency counters, while col- 
lecting at the same time significantly more informative 
profiles. We observed that, due to limitations of our cur- 
rent implementation of fc-BLPP such as lack of cache 
friendliness for some worst-case scenarios, on a few out- 
liers our profiler was slower than Ball-Larus, with a peak 
of 3.5x slowdown on one benchmark. 

2. Since the number of profiled paths in the control flow 
graph typically grows exponentially for increasing val- 
ues of fc, space usage can become prohibitive if paths 
spanning many loop iterations have to be exhaustively 
profiled. We noticed, however, that most long paths have 
smaU frequency counters, and are therefore uninteresting 
for identifying optimization opportunities. Hence, a use- 
ful addition to our method, which we do not address in 
this work, would be to prune cold nodes on-the-fly from 
the fc-SF, keeping information for hot paths only. 

5. Related Work 

The seminal work of Ball and Larus f4] has spawned much 
research interest in the last 15 years, in particular on pro- 
filing acyclic paths with a lower overhead by using sam- 
pling techniques Oldlllh or choosing a subset of interesting 



paths yi lla, l2lh . On the other hand, only a few works have 
dealt with cyclic paths profiling. 

Tallam et al. |20] extend the Ball-Larus path numbering 
algorithm to record slightly longer paths across loop back 
edges and procedure boundaries. The extended Ball-Larus 
paths overlap and, in particular, are shorter than two itera- 
tions for paths that cross loop boundaries. These overlap- 
ping paths enable very precise estimation of frequencies of 
potentially much longer paths, with an average imprecision 
in estimated total flow of those paths ranging from —4% to 
+8%. However, the average cost of collecting frequencies 
of overlapping paths is 4.2 times that of canonical BLPP on 
average. 

Roy and Srikant 11911 generaUze the Ball-Larus algorithm 
for profiling fc-iterations paths, showing that it is possible 
to number these paths efficiently using an inference phase 
to record executed backedges in order to differentiate cyclic 
paths. One problem with this approach is that, since the num- 
ber of possible fc-iteration paths grows exponentially with k, 
path IDs may overflow in practice already for small values of 
k and very large hash tables may be required. In particular, 
their profiling procedure aborts if the number of static paths 
exceeds 60, 000, while this threshold is reached on several 
small benchmarks already for fc = 3 |17J. This technique 
incurs a larger overhead than BLPP: in particular, the slow- 
down may grow to several times the BLPP-associated over- 
head as fc increases. 

Li et al. [17] propose a new path encoding that does not 
rely on an inference phase to explicitly assign identifiers to 
all possible paths before the execution, yet ensuring that any 
finite-length acyclic or cyclic path has a unique ID. Their 
path numbering needs multiple variables to record probe val- 
ues, which are computed by using addition and multipli- 
cation operations. Overflowing is handled by using break- 
points to store probe values: as a consequence, instead of a 
unique ID for each path, a unique series of breakpoints is 
assiged to each path. At the end of program's execution, the 
backwalk algorithm reconstructs the executed paths starting 
from breakpoints. This technique has been integrated with 
BLPP to reduce the execution overhead, resulting in a slow- 
down of about 2 times on average with respect to BLPP, but 
also showing significant performance loss (up to a 5.6 times 
growth) on tight loops. However, the experiments reported 
in [17] were performed on single methods of small Java 



11 



2013/4/19 



programs, leaving further experiments on larger industry- 
strength benchmarks to future work. 

The comparison of different path profiling techniques 
known in the literature with our approach is summarized in 
TablelU 

6. Conclusions 

In this paper we have presented a novel approach to cyclic 
path profiling, which combines the original Ball-Larus path 
numbering technique with a prefix tree data structure to 
keep track of concatenations of acyclic paths across multiple 
loop iterations. A large suite of experiments on a variety of 
prominent benchmarks shows that, not only our approach 
collects significantly more detailed profiles, but can also be 
faster than the original Ball-Larus technique by reducing the 
number of hash table operations. 

An interesting open question is how to use sampling- 
based approaches such as the one proposed by Bond and 
McKinley llOll to further reduce the path profiling over- 
head. We believe that the bursting technique, introduced by 
Zhuang et al. 12211 in the different scenario of calling context 
profiling could be successfully combined with our approach, 
allowing it to reduce the overhead while maintaining reason- 
able accuracy in mining hot paths. 

Another way to reduce the profiling overhead may be 
to exploit parallelism. We note that our approach, which 
decouples path tracing from profiling using an intermediate 
data stream, is amenable to multi-core implementations by 
letting the profiled code and the analysis algorithm run on 
separate cores using shared buffers. A promising line of 
research is to explore how to partition the data structures 
so that portions of the stream buffer can be processed in 
parallel. 

Finally, we observe that, since our approach is exhaustive 
and traces taken paths regardless of their hotness, it would be 
interesting to explore techniques for reducing space usage, 
by pruning cold branches of the fc-SF on the fly to keep 
the memory footprint smaller, allowing it to deal with even 
longer paths. 
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